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Abstract: Rooted triplets are becoming one of the important types of input for reconstructing 
rooted phylogenies. A rooted triplet is a phylogenetic tree on three leaves and shows the 
evolutionary relationship of the corresponding three species. In this paper, we investigate the 
problem of inferring the maximum consensus evolutionary tree from a set of rooted triplets. 
The mentioned problem is known to be APX-hard. We present two new heuristic algorithms. 
For a given set of m triplets on n species, the FastTree algorithm runs in 0(m + a(n)n 2 ) 
time, where a(ri) is functional inverse of Ackermann's function. This is faster than any 
other previously known algorithms, although, the outcome is less satisfactory. The BPMTR 
algorithm runs in 0(mn 3 ) time and in average performs better than any other previously 
known algorithms for this problem. 
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1. Introduction 

After publication of Charles Darwin's book On the origin of species; By means of natural selection, 
the theory of evolution was widely accepted. Since then remarkable developments in evolutionary studies 
brought the scientists to the Phylogenetic s, a field that studies the biological or the morphological data 
of species to output a mathematical model such as a tree or a network representing the evolutionary 

*An extended abstract of this article has appeared in Proceedings of Annual International Conference on Bioinformatics 
and Computational Biology (BICB 201 1) in Singapore. 
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17 interrelationship of species and the process of their evolution. Besides, Phylogenetics is not only limited 

is to the biology but may also arise anywhere that the concept of evolution appears. For example, a recent 

19 study in evolutionary linguistic employs phylogeny inference to clarify the origin of Indo-European 

20 language family[l]. Several approaches have been introduced to infer evolutionary relationships [2]. 

21 Amongst those, well known approaches are character based methods (e.g., Maximum Parsimony), 

22 distance based methods (e.g., Neighbor Joining and UPGMA) and quartet based methods (e.g., QNet). 

23 Recently, rather new approaches namely triplet based methods have been introduced. Triplet based 

24 methods output rooted trees and networks due to the rooted nature of triplets. A rooted triplet is 

25 a rooted unordered leaf labeled binary tree on three leaves and shows the evolutionary relationship 

26 of the corresponding three species. Triplets can be obtained accurately using a maximum likelihood 

27 method such as the one introduced by Chor et al. [3] or Sibley-Ahlquist-style DNA-DNA hybridization 

28 experiments [4]. Indeed, we expect highly accurate results from triplet based methods. However, 

29 sometimes due to experimental errors or some biological events such as hybridization (recombination) 

30 or horizontal gene transfer it is not possible to reconstruct a tree that satisfies all of the input constraints 

31 (triplets). There are two approaches to overcome this problem. The first approach is to employ a more 

32 complex model such as network which is the proper approach when the mentioned biological events 

33 have actually happened. The second approach tries to reconstruct a tree satisfying as many input triplets 

34 as possible. This approach is more useful when the input data contains error. The latter approach forms 

35 the subject of this paper. In the next section we will provide necessary definitions and notations. Section 

36 3 contains an overview of previous results. We will present our algorithms and experimental results in 

37 section 4. Finally, in section 5 open problems and ideas for further improvements are discussed. 

38 2. Preliminaries 

39 An evolutionary tree (phylogenetic tree) on a set S of n species, \S\ = n, is a binary, rooted 1 , 

40 unordered tree in which leaves are distinctly labeled by members of S (see Fig. la). A rooted triplet 

41 is a phylogenetic tree on three leaves. The unique triplet on leaves x, y, z is denoted by ((jc, y), z) or xy\z, 

42 if the lowest common ancestor of x and y is a proper descendant of the lowest common ancestor of x 

43 and z, or equivalently the lowest common ancestor of x and y is a proper descendant of lowest common 

44 ancestor of y and z (see Fig. lb). A triplet t (e.g., xy\z) is consistent with a tree T (or equivalently T is 

45 consistent with t) if t is an embedded subtree of T. It means t can be obtained from T by a series of edge 

46 contractions (i.e., if in T the lowest common ancestor of x and y is a proper descendant of the lowest 

47 common ancestor of x and z). We also say T satisfies t, if T is consistent with t. The tree in Fig. la is 

48 consistent with the triplet in Fig. lb. A phylogenetic tree T is consistent with a set of rooted triplets if 

49 it is consistent with every triplet in the set. We call two leaves siblings or cherry if they share the same 
so parent. For example, {x, y} in Fig. la form a cherry. 

51 A set of triplets R is called dense if for each set of three species {x, y,z},R contains at least one of three 

52 possible triplets xy\z, xz\y or yz\x. If R contains exactly one triplet for each set of three species, it is called 

53 minimal dense, and if it contains every possible triplet it is called maximal dense. Now we can define 

'More precisely speaking, an evolutionary tree can also be unrooted, however triplet based methods output rooted 
phylogenies. 
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Figure 1. Example of a phylogenetic tree and a consistent triplet 



(a) A phylogenetic 

tree (b) The triplet xy\z 





54 the problem of reconstructing an evolutionary tree from a set of rooted triplets. Suppose S is a finite set 

55 of species of cardinality n and R is a finite set of rooted triplets of cardinality m on S. The problem is to 

56 find an evolutionary tree leaf-labeled by members of S which is consistent with the maximum number 

57 of rooted triplets in R. This problem is called Maximum Rooted Triplets Consistency (MaxRTC) problem 

58 [5] or Maximum Inferred Local Consensus Tree (MILCT) problem [6]. This problem is NP-hard (see 

59 section 3) which means no polynomial-time algorithm can be found to solves the problem optimally 

60 unless P=NP. For this problem and similar problems, one might search for polynomial-time algorithms 

61 that produce approximate solutions. We call an algorithm an approximation algorithm if its solution 

62 is guaranteed to be within some factor of optimum solution. In contrast, heuristics may produce good 

63 solutions but do not come with a guarantee on their quality of solution. An algorithm for a maximization 

64 problem is called an a — approximation algorithm, for some a > 1, if for any input the output of 

65 algorithm be at most a times worse than the optimum solution. The factor a is called approximation 
ee factor or approximation ratio. 

67 3. Related works 

68 Aho et al. [7] investigated the problem of constructing a tree consistent with a set of rooted triplets 

69 for the first time. They designed a simple recursive algorithm which runs in 0(mn) time and returns a 

70 tree consistent with all of the given triplets if at least one tree exists. Otherwise, it returns null. Later 

71 Henzinger et al. [8] improved Aho et al.'s algorithm to run in min^^ + mn 1 / 2 ), 0(m + n 2 logn)} time. 

72 The time complexity of this algorithm further improved to min{0(n + mlog 2 n), 0(m + n 2 logn)} by 

73 Jansson et al. [9] using more recent data structures introduced by Holm et al. [10]. MaxRTC is proved 

74 to be NP-hard [6,11,12]. Byrka et al. [13] reported that this proof is an L-reduction from an APX-hard 

75 problem meaning that the problem is APX-hard in general (non-dense case). Later, Van Iersel et al. [14] 

76 proved that MaxRTC is NP-hard even if the input triplet set is dense. 

77 Several heuristics and approximation algorithms have been presented for the so called MaxRTC 

78 problem each of which performs better in practice on different input triplet sets. Gasieniec et al. [15] 
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79 proposed two algorithms by modifying Aho et al.'s algorithm. Their first algorithm which is referred 

so as One-Leaf-Split [5] runs in 0((m + n)logn) time and the second one which is referred as 

si Min-Cut-Split [5] runs in min{0(mn 2 + n 3 logn),0(n 4 )} time. The tree generated by the first 

82 algorithm is guaranteed to be consistent with at least one third of the input triplet set. This gives a lower 

83 bound for the problem. In another study, Wu [11] introduced a bottom up heuristic approach called 

84 BPMF 2 which runs in 0(mn 3 ) time. In the same study he proposed an exact exponential algorithm 

85 for the problem which runs in 0((m + n 2 )3 n ) time and 0(2") space. According to the results of Wu 

86 [11] BPMF seems to perform well in average on randomly generated data. Later Maemura et al. [16] 

87 presented a modified version of BPMF called BPMR 3 which employs the same approach but with a 

88 little different reconstruction routine. BPMR runs in 0(mn 3 ) time and according to Maemura et al.'s 

89 experiments outperforms BPMF. Byrka et al. [13] designed a modified version of BPMF to achieve an 

90 approximation ratio of 3. They also investigated how MinRTI 4 can be used to approximate MaxRTC and 

91 proved that MaxRTC admits a polynomial-time (3 — ^^)— approximation. 

92 4. Algorithms and experimental results 

93 In this section we present two new heuristic algorithms for the MaxRTC problem. 

94 4.1. FastTree 

95 The first heuristic algorithm has a bottom up greedy approach which is faster than the other previously 

96 known algorithms employing a simple data structure. 

97 Let R(T) denote the set of all triplets consistent with a given tree T. R(T) is called the reflective triplet 

98 set of T. It forms a minimal dense triplet set and represents T uniquely [17]. Now we define the closeness 

99 of the pair {i,j}. The closeness of the pair {i,j}, Cij, is defined as the number of triplets of the form 

100 ij\k in a triplet set. Clearly, for any arbitrary tree T, closeness of cherry species equals n — 2 which is 

101 maximum in R(T). The reason is that every cherry species has a triplet with every other specie. Now 

102 suppose we contract every cherry species of the form {i,j} to their parents pij and then update R(T) as 

103 following. For each contracted cherry species {i,j} we remove triplets of the form ij\k from R(T) and 

104 replace i and j with p^ within the remaining triplets. The updated set, R'(T'), would be the reflective 

105 triplet set for the new tree T'. Observe that for cherries of the form {p^, k} in T', C^k and Cj^ would 

106 equal n-3 in R(T). Similarly, for cherries of the form {Pij,Pki} in T", C i>k , C jtk , C^i and Cjj would equal 

107 n-4 in R(T). This forms the main idea of the first heuristic algorithm. We first compute the closeness 

108 of pairs of species by visiting triplets. Furthermore, sorting the pairs according to their closeness gives 

109 us the reconstruction order of the tree. This routine outputs the unique tree T for any given reflective 
no triplet set R(T). Yet, we have to consider that the input triplet set is not always a reflective triplet set. 
m Consequently, the reconstruction order produced by sorting may not be the right order. However, if the 

112 loss of triplets admits a uniform distribution it won't affect the reconstruction order. An approximate 

113 solution for this problem is refining the closeness. This can be done by reducing the closeness of the 

2 Best Pair Merge First 

3 Best Pair Merge with Reconstruction 

4 Minimum Rooted Triplet Inconsistency 
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114 pairs {i,k} and {j,k} for any visited triplet of the form ij\k. Thus, if the pair {i,j} were actually cherries, 

115 then the probability of choosing the pairs {i,k} or {j,k} before choosing the pair {i,j} due to triplet loss 
will be reduced. We call this algorithm FastTree. See Alg. 1 for the whole algorithm. 

Algorithm 1 FastTree 

Initialize a forest F consisting of n one-node trees labeled by species, 
for each triplet of the form ij\k do 

end for 

Create a list L of pairs of species. 

Sort L according to the refined closeness of pairs with a linear time sorting algorithm, 
while |L|>0 do 

Remove the pair {i,j} with maximum C^. 
if i and j are not in the same tree then 

Add a new node and connect it to roots of trees containing i and j. 
end if 
end while 

if F has more than one tree then 

Merge trees in any order until there would be only one tree, 
end if 

return the tree in F 

116 

117 Theorem 1. FastTree runs in 0(m + a(n)n 2 ) time. 

us Proof. Initializing a forest in step 1 takes 0(n) time. Steps 2-6 take 0(m) time. We know that the 

119 closeness is an integer value between and n — 2. Thus, we can employ a linear time sorting algorithm 

120 [18]. There are 0(n 2 ) possible pairs, therefore, step 8 takes 0(n 2 ) time. Similarly, the while loop in 

121 step 9 takes 0(n 2 ) time. Each removal in step 10 can be done in 0(1) time. By employing optimal data 

122 structures which are used for disjoint-set unions[18], the amortized time complexity of steps 11 and 12 

123 will be 0(a(n)), where a(n) is the inverse of the function f(x) = A(n,n), and A is the well known 

124 fast-growing Ackermann function. Furthermore, step 16 takes 0(na{n)) time. Hence, the running time 

125 of FastTree would be 0(m + a(n)n 2 ) . □ 

2 65536 

126 Since A(4, 4) = 2 , a(n) is less than 4 for any practical input size n. In comparison to the fast 

127 version of Aho et al.'s algorithm FastTree employs a simpler data structure and in comparison to Aho 

128 et al.'s original algorithm it has smaller time complexity. Yet, the most important advantage of FastTree 

129 to Aho et al.'s algorithm is that it won't stuck if there is not a consistent tree with the input triplets, and 

130 it will output a proper tree in such a way that the clusters are very similar to that of the real network. 

131 The tree in Fig. 2 is the output of FastTree on a dense set of triplets based on yeast Cryptococcus gattii 

132 data. There is no consistent tree with the whole triplet set, however, Van Iersel et al. [19] presented a 
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133 level-2 network consistent with the set(see Fig. 3). This set is available online [20]. In comparison to 

134 BPMR and BPMF, FastTree runs much faster for large set of triplets and species. However, for highly 

135 sparse triplet sets, the output of FastTree may satisfy considerably less triplets than the tree constructed 

136 by BPMF or BPMR. 

137 4.2.BPMTR 

138 Before explaining the second heuristic algorithm we need to survey BPMF [11] and BPMR [16]. 

139 BPMF utilizes a bottom up approach similar to hierarchal clustering. Initially, there are n trees each 
ho of which contains a single node representing one of n given species. In each iteration, the algorithm 
hi computes a function called escort for each combination of two trees. Furthermore, two trees with the 

142 maximum e_score are merged into a single tree by adding a new node as the common parent of the 

143 selected trees. Wu [11] introduced six alternatives for computing the escore using combinations of w, 

144 p and t. (see Tab. 1). Though, in each run one of the six alternatives must be used. In the function 

145 escore(C 1 ,C 2 ), w is the number of triplets satisfied by merging C\ and C 2 which is the number of 
us triplets of the form ij\k in which i is in C\, j is in C 2 and k is neither in C\ nor in C 2 . The value of p is 
147 the number of triplets that is in conflict with merging C\ and C 2 . It is the number of triplets of the form 
us ij\k in which i is in C\, k is in C 2 and j is neither in C\ nor in C 2 . The value of t is the total number of 

149 triplets of the form ij\k in which i is in Ciand j is C 2 . Wu compared the BPMF with One-Leaf-Split 

150 and Min-Cut-Split and showed that BPMF works better on randomly generated triplet sets. He also 

151 notifies that none of six alternatives of e_score is absolutely better than the other. 

152 Maemura et al. [16] introduced a modified version of BPMF called BPMR outperforming the results 
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Table 1. The six alternatives of escore 



If-Penalty Ratio Type 



False 
True 



w w/(w+p) w/t 
w-p (w-p)/(w+p) (w-p)/t 
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153 of BPMF. BPMR works very similarly in comparison to BPMF except for a reconstruction step which 

154 is used in BPMR. Suppose T x and T y are two trees having the maximum escore at some iteration and 

155 are selected to merge into a new tree. By merging T x and T y some triplets will be satisfied, but some 

156 other triplets will be in conflict. Without loss of generality, suppose T x has two subtrees namely left 

157 subtree and right subtree. Besides, suppose a triplet ij\k in which i is in the left subtree of T x , k is in 

158 the right subtree of T x and j is in T y . Observe that by merging T x and T y the mentioned triplet becomes 

159 inconsistent. However, swapping T y with the right subtree of the T x satisfies this triplet while some other 

160 triplets become inconsistent. It is possible that the resulting tree of this swap satisfy more triplets than 

161 the primary tree. This is the main idea behind the BPMR. In BPMR, in addition to the regular merging 

162 of T x and T y , T y is swapped with the left and the right subtree of T x and also T x is swapped with the 

163 left and the right tree of T y . Finally, among these five topologies we choose the one that satisfies more 

164 triplets. 

165 Suppose the left subtree of the T x has also two subtrees. Swapping T y with one of these subtrees would 

166 probably satisfy new triplets while some old ones would become inconsistent. There are examples in 

167 which this swap results in a tree that satisfies more triplets. This forms our second heuristic idea that 

168 swapping of T y with every subtree of T x should be checked. T x should also be swapped with every 

169 subtree of T y . At every iteration of BPMF after choosing two trees maximizing the e .score, the algorithm 

170 tests every possible swapping of these two trees with subtrees of each other and then chooses the tree 

171 having the maximum consistency with triplets. We call this algorithm BPMTR 5 . See Alg. 2 for details 

172 of the BPMTR. 

173 Theorem 2. BPMTR runs in 0(mn 3 ) time. 

Proof. Step 1 takes 0(n) time. In steps 2, initially T contains n clusters, but in each iteration two clusters 
merge into a cluster. Hence, the while loop in step 2 takes 0(n) time. In Step 3, e_score is computed 
for every subset of T of size two. By applying Bender and Farach-Colton's preprocessing algorithm 
[21] which runs in O(n) time for a tree with n nodes, every LCA query can be answered in 0(1) time. 
Therefore, the consistency of a triplet with a cluster can be checked in 0(1) time. Since there are m 
triplets, step 3 takes ('^')0(m) time. In steps 5, 9 and 15 T best is a pointer that stores the best topology 
found so far during each iteration of the while loop in 0(1) time. The complexity analysis of foreach 
loops in steps 6-11 and 12-17 are similar, and it is enough to consider one. Every rooted binary tree with 
n leaves has 0(n) internal nodes so the total number of swaps in step 7 for any two clusters will be at 
most 0(n — |T|). In step 8 computing the number of consistent triplets with T swappe d takes no more than 
0(m) time. Steps 4, 7 and 18 are implementable in 0(1) time. Accordingly, the running time of steps 
2-19 would be: 



\T\=2 




(1) 



174 



175 



Step 20 takes 0(1) time. Hence, the time complexity of BPMTR is 0(mn 3 ). 



□ 



Best Pair Merge with Total Reconstruction 
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Algorithm 2 BPMTR 

1: Initialize a set T consisting of n one-node trees labeled by species. 

2: while |r|>l do 

3: Find and remove two trees T x , T y with maximum escort. 

4: Create a new tree T merge by adding a common parent to T x and T y 

5- Tbest • — T- merge 

6: for each subtree T sub of T x do 

7: Let T swapped be the tree constructed by swapping T sub with T y 

8: if the number of consistent triplets with T swappe d was larger than the number of triplets 
consistent with T best then 

9- T-best • ^swapped 

10: end if 

11: end for 

12: for each subtree T sub of T y do 

13: Let T swapped be the tree constructed by swapping T sub with T x 

14: if the number of consistent triplets with T swapped was larger than the number of triplets 
consistent with T best then 

15- T bes t . T swapped 

16: end if 

17: end for 

18: Add T best to T. 

19: end while 

20: return the tree in T 
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No. of species and triplets 


% better results 


% worse results 


n=20, m=500 


%29 


%0.0 


n=20, m=1000 


%37 


%1 


n=30, m=500 


%61 


%3 


n=30, m=1000 


%62 


%4 



176 We tested BPMTR over randomly generated triplet sets with n = 15, 20 species and m = 500, 1000 

177 triplets. We experimented hundred times for each combination of n and m. The results in Tab. 2 indicate 

178 that BPMTR outperforms BPMR. However, in more than hundreds of tests there were few examples that 

179 BPMR performed better than BPMTR. For n=30 and m=1000, in sixty two triplet sets out of hundred 

180 randomly generated triplet sets, BPMTR satisfied more triplets. In thirty four triplet sets, BPMR and 

181 BPMTR had the same results and in only four triplet sets BPMR satisfied more triplets. 

182 5. Conclusion and Open Problems 

183 In this paper we presented two new algorithms for the so called MaxRTC problem. For a given set of 

184 m triplets on n species, the FastTree algorithm runs in 0(m + a(n)n 2 ) time which is faster than any other 

185 previously known algorithm, although, the outcome can be less satisfactory for highly spars triplet sets. 

186 The BPMTR algorithm runs in 0(mn 3 ) time and in average performs better than any other previously 

187 known approximation algorithm for this problem. There are still more ideas for improvement of the 

188 described algorithms. 

189 1. In the FastTree algorithm to compute the closeness of pairs of species we check triplets, and for 

190 each triplet of the form ij\k we add a weight w to Cy and subtract a penalty p from and Cj^- In this 

191 paper, we set w — p — 1. If one assigns different values for w and p the closeness of pairs of species 

192 will be changed and the reconstruction order will be affected. It is interesting to check for which values 

193 of w and p FastTree performs better. 

194 2. Wu [11] introduced six alternatives for e .score each of which performs better for different input 

195 triplet sets. It is interesting to find a new function outperforming all the alternatives for any input triplet 

196 set. 

197 3. The best known approximation factor for the MaxRTC problem is 3 [13]. This is the approximation 

198 ratio of BPMF Since MaxRTC is APX-hard a PTAS is unattainable, unless P=NP. However, [5] suggest 

199 that an approximation ratio in the region of 1.2 might be possible. Finding an a— approximation 

200 algorithm for MaxRTC with a < 3 is still open. 

201 4. It is also interesting to find the approximation ratio of FastTree in general and for reflective triplet 

202 sets. 
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